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Abstract. In a previous paper we described a system which recursively 
recovers a super-resolved three dimensional surface model from a set of 
images of the surface. In that paper we assumed that the camera cali- 
bration for each image was known. In this paper we solve two problems. 
Firstly, if an estimate of the surface is already knowm, the problem is to 
calibrate a new image relative to the existing surface mode!. Secondly, if 
no surface estimate is available, the relative camera calibration between 
the images in the set must be estimated. This will allow an initial surface 
model to be estimated. Results of both types of estimation are given. 


1 Introduction 

In this paper we discuss the problem of camera calibration, estimating the posi- 
tion and orientation of the camera that recorded a particular image. This can be 
viewed as a parameter estimation problem, w’here the parameters are the camera 
position and orientation. We present two methods of camera calibration, based 
on two different views of the problem. 

1. Using the entire image /, the parameters are estimated by minimizing {I - 
/(6>)) 2 , w'here & are the camera parameters and I\0 ) is the image simulated 
from the (assumed known) surface model. 

2. Using features extracted from the image, the parameters are estimated by 
minimizing ( u — u(@)) 2 , where u(0) is the position of the estimated feature 
projected into the image plane. 

Under the assumption that a surface model is knowm, the first method has a 
number of advantages. It makes no assumptions about the size of the displace- 
ments between the images; it gives much more accurate estimation as many 
thousands of pixels axe used to estimate a very few camera parameters; most 
fundamentally for our problem, it does not require feature extraction - images 
of natural scenes often do not have sharp corner features required for standard 
approaches to camera calibration. 

In an earlier paper [1] we described a system that inferred the parameters of 
a high resolution triangular mesh model of a surface from multiple images of that 
surface. It proceeded by a careful modeling of the image formation process, the 
process of rendering , and showed how the rendering process could be linearized 
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with respect to the parameters of the mesh fin that case, the height and albedo 
values). This linearization turned the highly nonlinear optimization for the mesh 
parameters into the tractable solution of a very high dimensional set of sparse 
linear equations. These were solved using conjugate gradient, using iterative 
linearization about the estimate from the previous iteration. 

The work in [1] required that the camera parameters (both internal and ex- 
ternal) were known, and also assumed that the lighting parameters were known. 
In this paper we continue to assume that the internal parameters are known - 
NASA mission sensors are extensively calibrated before launch - and that the 
lighting parameters are known. Here we will describe how the linearization of 
the rendering process can be performed with respect to the camera parameters, 
and hence how the external camera parameters can be estimated by minimizing 
the error between the observed and synthesized images. We assume the usual 
pinhole camera model [10]. 

To estimate the camera parameters as described above requires that a surface 
model is already available. For the initial set of images of a new region, no surface 
model is available. In principle one could optimize simultaneously over both the 
surface parameters and the camera parameters. In practice, because the camera 
parameters are correlated with all the surface parameters, the sparseness of the 
set of equations is destroyed, and the joint solution becomes computationally 
infeasible. Instead, for the initial set of images we use the standard approach of 
feature matching, and minimize the sum squared error of the distance on the 
image plane of the observed feature and the projection of the estimated feature 
in 3D. 

A surface can be inferred using the camera parameters estimated using fea- 
ture matching. New images can then be calibrated relative to this surface esti- 
mate, and used in the recursive update procedure described in [1]. 


2 Calibration by Minimizing the Whole Image Error 


Consider a surface where the geometry is modeled by a triangular mesh, and 
an albedo value is associated with each vertex of the mesh. A simulated camera 
produces an image / of the mesh. The camera is modeled as a simple pinhole 
camera, and its location and orientation is determined by six parameters, its 
location in space (x cy y c , z c ), the intersection of the camera axis with the x - y 
plane, (xo?yo)> and the rotation of the camera about the camera axis, <p. The 
last three of these parameters can be replaced by the three camera orientation 
angles, and we will use both representations in different places, depending on 
which is more convenient. These parameters are collected into a vector 0. 

For a given surface, given lighting parameters, and known internal camera 
parameters, the image rendered by the synthetic camera is a function of 0, 
ie / = 7(0). Making the usual assumption of independent Gaussian errors, 
and assuming a uniform prior on 0 reduces the maximum likelihood estimation 
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problem to a least-squares problem 

6=mm £(/„ - 7 P (0)) 2 (1) 

P 

where 0 is the maximum-likelihood estimate of the camera parameters. Because 
1(0) is in general a nonlinear function of (9, to make this estimation practical 
we linearize 1(0) about the current estimate Oq 

/(0)=/(©o) + Dx; D = j^j (2) 

where D is the matrix of derivatives evaluated at Qq . and x = 0 - Go- This re- 
duces the least-squares problem in equation 1 to the minimization of a quadratic 
form, Fo(x) f 


F 2 (x' = ^xDD r x r - bx (3) 

b = (/ — I(0))D (4) 

w’hich can be solved using conjugate gradient or similar approaches. In the fol- 
lowing section w r e will describe how an object space Tenderer can also be made 
to compute D, the derivatives of the pixel values with respect to the camera 
parameters. 

2.1 Forming the image 

As discussed in [1], to enable a Tenderer to also compute derivatives it is necessary 
that all computations are done in object space. This implies that the light from 
a surface triangle, as it is projected into a pixel, contributes to the brightness of 
that pixel with a weight proportional to the fraction of the area of the triangle 
w f hich projects into that pixel The total brightness of the pixel is thus the sum of 
the contributions from all the triangles whos projection overlaps with the pixel. 

4 = £/a*a. (5) 

A 

where /£ is the fraction of the flux that falls into pixel p, and # & is the total 
flux from the triangle. This is given by 

= pF(a') cosa y cos *0 AQ, (6) 

E(a') = A(F cos a' +Z a ). 

Af2 = S/d 2 . 

Here p is an average albedo of the triangular facet. Orientation angles a s and 
a v are defined in figure 1. is the total radiation flux incident on the 

triangular facet with area A. This flux is modeled as a sum of two terms. The first 
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Fig. 1. Geometry of the triangular facet, illumination direction and viewing direction. 
Zj is the vector to the illumination source; z v is the viewing direction. 


term corresponds to direct radiation with intensity I s from the light source at 
infinity (commonly the sun). The second term corresponds to ambient light with 
intensity I 1 . The parameter 8 in equation (6) is the angle between the camera 
axis and the viewing direction (the vector from the surface to the camera); k is 
the lens falloff factor. .4/? in (6) is the solid angle subtended by the camera which 
is determined by the area of the lens 5 and the distance d from the centroid of 
the triangular facet to the camera. If shadows are present on the surface the 
situation is a little more complex. In this paper we assume that there are no 
shadows or occlusions present. 


2.2 Computing image derivatives with respect to camera 
parameters 


Taking derivatives of the pixel intensities in equation 5 gives 


dl p 

dO l 
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+ 
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Consider first We neglect the derivatives with respect to the fall-off 

angle, as their contribution will be small and so it is clear from equation 6 that 
the derivative with respect to any of the camera orientation singles is zero. 

The derivative with respect to the camera position parameters is given by 
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= ^(Zi - Z v (z v .Zi)) 
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Fig. 2. The intersection of the projection of a triangular surface element (io , ii . 12 ) onto 
the pixel plane with the pixel boundaries. Bold lines corresponds to the edges of the 
polygon resulting from the intersection. Dashed lines correspond to the new positions 
of the triangle edges when point Pl 0 is displaced by <5P 


where v is the vector from the triangle to the camera, v = [v|, 0, are the three 
components of the camera position, £ t are unit vectors in the three coordinate 
directions and z v is a unit vector in the direction of the illumination (see figure 

i). 

d f p 

Now consider For triangles that fall completely within a pixel. the sec- 
ond term in equation 7 is zero, as the derivative of the area fraction is zero. For 
triangles that intersect the pixel boundary, this derivative must be computed. 
When the camera parameters change, the positions of the projections of the mesh 
vertices into the image plane will also move. The derivative of the fractional area 
is given by 


J_ y- ( d.jppiygon _ /m 

dO l A& . . I 9Pj Ja 3Pj j d6i ' V 

J=lo.lx,la V ' 

where Pj is the projection of point Pj onto the image plane, and A& is the 
area of the projection of the triangle. The point displacement derivatives will be 
detailed below. 

Thus, the task of computing the derivative of the area fraction (9) is reduced 
to the computation of dA&/d Pj and <9.4 p0 | y g 0n /<9Pj. Note that the intersection 
of a triangle and a pixel for a rectangular pixel boundary can, in general, be a 
polygon with 3 to 7 edges with various possible forms. However the algorithm 
for computing the polygon area derivatives that we have developed is general, 
and does not depend on a particular polygon configuration. The main idea of 
the algorithm can be described as follows. Consider, as an example, the polygon 
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Fig. 3. Illustration of the geometry for determining the rotation between world and 
camera coordinates. 


shown in figure 2 which is a part of the projected surface triangle with indices 
io, ii, l 2 We are interested in the derivative of the polygon area with respect to 
the point Pj 0 that connects two edges of the projected triangle, (Pi a ,Pj 0 ) and 
(PiQ.Pii)- These triangular edges contain segments (I, J) and (K, L) that are 
sides of the corresponding polygon. It can be seen from figure 2 that when the 
point . Pi 0 is displaced by <5Pi 0 the change in the polygon area is given by the 
sum of two terms 

^polygon = <^I,J + <5.4 K ,l 

These terms are equal to the areas spanned by the two corresponding segments 
taken with appropriate signs. Therefore the polygon area derivative with respect 
to the triangle vertex Pi 0 is represented as a sum of the two “segment area” 
derivatives for the 2 segments adjacent to a given vertex. The details of this 
computation will be given elsewhere. 

We now' consider the derivatives of the point positions. 


Derivatives of the position of the projection of a point on the image 
plane The pinhole camera model gives 


[AR(P - t)], 
[AR(P - t)] 


where R is the rotation matrix from world to camera coordinates, t is the trans- 
lation of between camera and world coordinates and A is the matrix of camera 
internal parameters [13]. In this work we assume that the internal camera pa- 
rameters are known, and further that the image plane axes are perpendicular, 
and that the principle point is at the origin. This reduces A to a diagonal matrix 
with elements (k u ,k Vl l). 

The rotation matrix R can be written in terms of the Rodrigues vector [10] 
Q = (gi, 02, Qz) which defines the axis of rotation, and 9 = |#| is the magnitude 
of the rotation. (Clearly q can be written in terms of the camera position, the 
look-at point and the view-up vector.) 




(id 


wh«.TC 


0 -Qi Qn 
H = | Qz 0 -Qi 
-On 0[ 0 


k 12) 


Let H = z H and = ~g L then 


9 

<9R _ sin# 

del ~ ~ Hl ~ 


+ H~ sin 9 - 2 


+ (HHi + HiH) 

1 - cos 8 


(1 - cos#) 

# 


„( coss -^) r , Il3 ) 


9 


where Hi = $&. Then 

5u,_ /'[f?< p -‘)] I IR(P-‘>].[557< p -‘>],'\ 

dm “^(R(p-t)j, ([R(p-t)],i» J 


The derivatives with respect to the position parameters are 

dui [AR(P - t)] < [AR] 3 , J - [AR]i.j 

dtj ([AR(P - 1)] 3 ) 2 [AR(P - 1)] 3 1 ° j 

In practice, optimization using the camera orientation angles directly is inad- 
visable, as a small change in the angle can move the surface a long distance in 
the image, and because the minimization in equation 1 is based on a sum over 
all pixels in the image, this can make for rapid changes in the cost, and failure 
to converge. Instead we use the i; look-at” point (xo,yo) which is in the natural 
length units of the problem. The conversion of the derivatives from angles to 
look-at is an application of the chain rule, and is not detailed here. 

We now consider the second problem, calibration using features detected in 
the images. 


3 Calibration by Minimizing Feature Matching Error 

It is well known that camera calibration can be performed using corresponding 
features in two or more images [4]. This estimation procedure also returns the 
3D positions of the corresponding image features. So the parameter space is 
augmented from 0* (where / indexes the frame, or camera parameter set) with 
P n , the positions of the 3D points. 

If it is assumed that the error between u£, the feature located in image / 
corresponding to 3D point P n , and u£, the projection of P n into image / using 
camera parameters 0* , is normally distributed, then the negative log-likelihood 
for estimating & find P is 

L(e{,P n ;i = 1...6J = l...k',n = l...N) = f2 E EKn~Kn) 3 (16) 

/=! n€!?/ 1=1 
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Fig. 4. Four synthetic images of an area of Death Valley 


where 1?/ is the set of features that are detected in image /. Note that the 
features detected in a given image may well be a subset of all the P n ’s. / indexes 
the components of u. This form of the likelihood assumes that there is no error 
in the location of the features in the images. 

Typically the non-linear likelihood in equation 16 is minimized using a stan- 
dard non-linear minimization routine, for example the Levenberg-Marquardt al- 
gorithm [11]. The dimensionality of the parameter space in equation 16 is large, 
equal to 6x the number of images + 3x the number of 3D points, and there 
will in general be many points. Because of this large dimensionality it is impor- 
tant to use exact derivatives, to avoid slow convergence and reduce the need for 
good initialization. In section 3.2 below we derive the analytic derivatives of the 
likelihood function, enabling this robust convergence. 

The other practical problem in using feature-based camera calibration is the 
detection and matching of image features. 

3.1 Robust feature matching 

The maximum likelihood solution to the camera parameter estimation problem 
is known to be extremely sensitive both to mismatches in the feature correspon- 
dences, and to even small errors in the localization of the detected features. To 



0 



Fig. 5. “Strength” of the features found in image 0 


reliably estimate the camera parameters we need reliably located features, reli- 
ably matched. More accurate estimation results from using a smaller set of well 
localized and well matched features, than a much larger set that includes even 
a single outlier. Extreme conservatism in both feature detection and matching 
is needed. 

The feature detector most commonly used is the Harris corner detector [2]. 
This feature detector was developed in the context of images of man-made envi- 
ronments, which contain many strong corners. Remote sensed images of natural 
scenes of the type we are concerned with (see figure 4) contain some, but much 
fewer, strong corner features. If the U feature strength’ 1 given by the Harris detec- 
tor is plotted for the features detected on the images in figure 4, it can be seen 
to fall off rapidly - see figure 5. For this type of image, it is therefore necessary 
to use only a small number of features, where the associated feature strength is 
high enough to ensure accurate feature localization. This is a limiting factor in 
the use of feature based calibration for this type of images, and a motivation for 
developing the wdiole image approach described above. Feature detectors more 
suited to natural scenes are clearly needed, but there will always be particularly 
mute environments where feature based methods will fail. 

Because of the extreme sensitivity to mismatches, it it necessary to ensure 
that the features found are matched reliably. This is a classic chicken-and-egg 
problem: to determine reliable matches it is necessary to correctly estimate the 
camera parameters, and to correctly estimate the camera parameters requires 
reliable matches. For this reason, feature matching has spawned a number of 
methods based on robust estimation (and one approach which attempts to do 
away with explicit feature matching altogether [12]). 

The RANSAC algorithm [3] finds the largest set of points consistent with a 
solution based on a minimal set, repeatedly generating trial minimal sets until 
the concensus set is large enough. Zhang [4] also bases his algorithm on esti- 
mation using minimal sets, but uses LMedS (Least Median Squares) to select 
the optimal minimal set. The estimate of the fundamental matrix [10] generated 
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using that minimal set is used to reject outliers. Zhang also applies a relaxation 
scheme to disambiguate potential matches. This is a formalization of the heuris- 
tic that features nearby in one image are likely be close together in another 
image, and in the same relative orientation. In our work we use a modifica- 
tion of Zhang’s algorithm described below. Torr and co-w’orkers have developed 
MLESAC [7] and IMPSAC [6] as improvements on RANSAC. IMPSAC uses a 
multiresolution framework, and propagates the probability density of the camera 
parameters between levels using importance sampling. This achieves excellent re- 
sults, but was considered excessive for our application, where prior knowledge 
of the types of camera motion between frames is known. 

Our algorithm proceeds as follows: 

1. Use the Harris corner detector to identify features, rejecting those which 
have feature strength too low to be considered reliable. 

2. Generate potential matches using the normalized correlation score between 
window's centered at each feature point. Use a high threshold (w'e use ^ = 0.9) 
to limit the number of incorrect matches. 

3. Use LMedS to obtain a robust estimate of the fundamental matrix: 

- Generate an 8 point subsample from the set of potential matches, where 
the S points are selected to be widely dispersed in the image (see [4]). 

- Estimate the fundamental matrix. Zhang uses a nonlinear optimization 
based approach. We have found that the simple eight point algorithm [9] 
is suitable, because our features are in image plane coordinates, which 
correspond closely to the normalization suggested in [8]. 

- Compute the residuals, u r Fu for all the potential matches, and store 
the median residual value. 

- Repeat for a new subsample. The number of subsamples required to 
ensure with a sufficiently high probability that a subset with no outliers 
has been generated depends on the number of features and the estimated 
probability that each potential match is an outlier. 

- Identify F m j n as the fundamental matrix which resulted in the lowest 
median residual value. 

4. Use F m i n to reject outliers by eliminating matches which have residuals 
greater than a threshold value. (We used t m i n = 1.0 pixels.) 

5. Use the following heuristic to eliminate any remaining outliers: because the 
images are remote sensed images, the variations in heights on the surface 
are small in comparison to the distance to the camera. So points on the 
surface that are close together should move similar amounts, and in similar 
directions 1 . The heuristic is 

- Consider all features within a radius r = 0.2 of the image size from the 
current feature. The match found for that feature is accepted if both of 
the following conditions hold: 

1 This is not true if the camera moves towards the surface, but even then, if the 
movement towards the surface is not excessive, this heuristic still approximately 
holds. 
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(a) The length of the vector between the features is less than a thresh- 
old times the length of the largest vector between features in the 
neighbourhood. 

(b) The average distance between neighbouring features in one image is 
less than a threshold times the average distance between the same 
features in th second image. 

In both cases the threshold used was 1.3. 

Features are matches between all pairs of images in the set, and are used in the 
likelihood minimization (16) to estimate the camera parameters. Note that not 
all features will be detected in all the images in a set, so the likelihood will only 
contain terms for the features actually found in that image. 


3.2 Computing derivatives of the feature positions 

To effectively minimize L(&, P) in equation 16 we need to compute its deriva- 

/ dix ^ 

tives, which reduces to computing --V- and -rp r 1 . In what follows we will con- 

O&j OlT n 

centrate on one frame and drop the / index. 

We have already shown in equation 14 the expression for the derivative of the 
point position with respect to the rotation angles and camera position, which 

together make up . It remains only to give the expression for the derivative 

with respect to the 3D feature point. This is the same as the derivative with 
respect to the camera position, see equation 15, but w'ith the sign reversed, 
giving 

dui, n _ [AR]jj [AR(P„ - t)],[AR] 3 j 

dP n ,j ~ [AR(P„ - t)] 3 ([AR(P n - t)] 3 )" 

Where the subscript n indexes the features. 

4 Results and conclusions 

Figure 4 shows four synthetic images of Death Valley. The images were generated 
by rendering a surface model from four different viewpoints and with different 
lighting parameters. The surface model was generated by using the USGS Digital 
Elevation Model of the area for the heights, and using scaled intensities of a 
LANDS AT image as surrogate albedos. The size of the surface was approximately 
350 x 350 points, and the distance between grid points was taken as 1 unit. The 
images look extremely realistic. 

Table 1 shows the results of estimating the camera parameters using features 
detected in the images. These estimates are good, but far from exact. Considering 
the images in figure 4, it is clear that there are few strong corner features. Figure 
6 shows the set of strong features detected in image 0 (the top left image in figure 
4). Two things are apparent. Firstly, the features are all due to rapid changes in 
albedo. Secondly, with two exceptions, the features are clustered. This clustering 
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Fig. 6. Features found in image 0 


reduces the accuracy of the estimation. That the features are mostly albedo 
features confirms that feature based approaches are not applicable to many of 
the types of image we are interested in. 

Table 1 also shows the results for calibration using matching to a pre-existing 
3D surface model. The estimation was initialized at the results of the point- 
matching estimation. The minimization of (/-/(<9)) 1 2 was performed iteratively, 
re-rendering to compute a new I at the value of & at the convergence of the 
previous minimization. As expected, the estimates are very significantly better 
than the results from point matching, and are very accurate. However, these 
results are predicated on the existence of a surface model. 

These results suggest an approach to camera calibration that is the subject 
of our current, ongoing, research. Point matching can be used to estimate initial 
camera parameters. A surface can then be inferred using these parameters. The 
whole-image matching approach can be used to re-estimate the camera param- 
eters, and a new surface estimate can be made using the new camera parameter 
estimates. This process can be iterated. The convergence of this iterative proce- 
dure is currently being studied. 
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